-
Maps characters (text) to byte sequences and vice versa. Used for text fields in serialization.
Naming Confusion
-
When someone says "custom package encoding" , they usually mean:
-
A framing protocol (how message start/end is delimited).
-
A custom serialization/deserialization strategy.
-
A binary or textual format for transmitting structures over the network.
-
-
Using "encoding" for package framing strategies is technically valid but potentially ambiguous.
-
In networking, itβs better to use more specific terms.
-
The word "encoding" itself isnβt wrong but should be interpreted in the technical context.
Text
UTF-8
-
Unicode Transformation Format β 8-bit
-
Size :
-
ASCII characters (0β127) use 1 byte
-
Non-ASCII characters use up to 4 bytes
-
For languages with many non-ASCII characters (e.g., Chinese, Japanese), it can take more space than UTF-16
-
-
Web standard (used by HTML, JSON, XML, etc.)
-
Backward compatible with ASCII; valid ASCII text is valid UTF-8
-
Serialization:
-
UTF-8 can be considered a form of serialization, specifically for binary text serialization
-
UTF-16
-
Size :
-
BMP characters (Basic Multilingual Plane, U+0000 to U+FFFF) use 2 bytes
-
Characters outside BMP (e.g., emojis, historical scripts) use 4 bytes (surrogate pairs)
-
More efficient for languages with many BMP characters (e.g., many Asian languages)
-
-
Widely used in some APIs and programming languages (e.g., Java, Windows, .NET)
UTF-32
-
Size : All characters are 4 bytes, making manipulation and indexing easier
ASCII
-
American Standard Code for Information Interchange
-
Legacy system compatibility : For old systems or devices that only support ASCII
-
Simple English text : When text contains only basic characters (AβZ letters, 0β9 digits, basic punctuation)
-
Simplicity : ASCII uses exactly 1 byte (8 bits) per character, simplifying processing in very basic systems